Automated Credit Card Fraud Detection

The data analyzed here is a set of 284,807 credit card transactions, of which 492 are fraudulent. Each point has 30 features and a label. 28 of the features are the result of a PCA transformation and scaling, effectively anonymizing the transactions. This may have caused data leakage, but that cannot be determined without more information as to how the PCA was performed. Here is where I sourced the data from.

The data are highly imbalanced, so we will need to either find a good way to artificially balance it or use a method that is resistant to imbalance. We will be doing the former, for the sake of practice. Out evaluation metrics will have to be chosen with care, as well.

The packages required for this notebook are all available in the standard Anaconda 3 installation - with one exception. If you with to run this yourself, you will need to install the imbalanced-learn package.

Library Imports

In [1]:
# Import necessary libraries.
#----------------------------------------------------------#

# General data storage and manipulation
import numpy as np
import pandas as pd
from os import listdir

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import SMOTE

# Model selection
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit

# The models
from sklearn.linear_model import SGDClassifier
from imblearn.pipeline import make_pipeline, Pipeline

# Model evaluation
from sklearn.metrics import (roc_auc_score, make_scorer, roc_curve, precision_recall_curve, 
                             auc, average_precision_score)
from time import process_time

Data Imports

In [2]:
# Grab the data.
#----------------------------------------------------------#

filepaths = ["Data/" + f for f in listdir("./Data") if f.endswith('.csv')]
data = pd.concat(map(pd.read_csv, filepaths))

Preprocessing

There are a few minor issues with the data that we must address before moving on. The first is that two of the features - time and amount - have not yet been scaled. This could impact the performance of our model if not corrected. The second is that the feature matrix and label vector are combined. This is easily resolved.

In [3]:
# Scale the yet unscaled features.
#----------------------------------------------------------#

rob_scaler = RobustScaler()

time_scaled = rob_scaler.fit_transform(data['Time'].values.reshape(-1, 1))
amount_scaled = rob_scaler.fit_transform(data['Amount'].values.reshape(-1, 1))

data.drop(['Time', 'Amount'], axis = 1, inplace = True)

data.insert(0, 'time_scaled', time_scaled)
data.insert(1, 'amount_scaled', amount_scaled)
In [4]:
# Create the feature matrix and label vector.
#----------------------------------------------------------#

X = data.drop('Class', axis = 1)
y = data['Class']

Data Exploration

As always, it serves us well to take a look at the data before attempting to fit a model to it. To do so properly we will first randomly undersample the data. We will then produce a corner plot of the data and calculate the correlation between each feature pair.

In [23]:
# Generate the corner plot.
#----------------------------------------------------------#

plot_mat = pd.concat([data.where(data['Class'] == 0).dropna().sample(400), 
                      data.where(data['Class'] == 1).dropna().sample(400)])

g = sns.PairGrid(plot_mat, hue = "Class", vars = X.columns)
g = g.map_diag(sns.kdeplot, lw = 2, shade = True)
g = g.map_lower(sns.scatterplot, alpha = 0.6)
g = g.map_upper(sns.scatterplot, alpha = 0.6)
g = g.add_legend(title = "Class")

plt.savefig("Figures/Corner.png", bbox_inches = "tight");
In [30]:
# Generate the correlation plot.
#----------------------------------------------------------#

plot = plt.figure(figsize = (11.5, 10))
sns.heatmap(plot_mat.corr(), cmap = 'viridis')

plt.savefig("Figures/Correlations.png", bbox_inches = "tight");

Model Selection

We can see from the above two figures that classification on this data is not a lost cause. The positive and negative classes differ markedly in some feature marginal distributions, and class is correlated with several of our features. It is then time to select a model for the data. The selection metric I chose is the average precision. This approximates the area under the precision-recall curve It is worth noting that we are testing only smooth models. This is so that we can see the impact of different model sensitivities later on.

We will be using a stratified shuffle split to get our training set to keep the class ratio equal. We will also be using the Synthetic Minority Oversampling Technique (SMOTE) to oversample our training set and balance the classes. In essence, this amounts to assuming that the class of fraudulent transactions is approximately simply connected in our feature space. This may not be true, in which case model performance will be negatively affected.

In [41]:
# Find the best model parameters.
#----------------------------------------------------------#

# Instantiate the oversampler, kernel transform, and classifier.
sampler = SMOTE(n_jobs = -1)
clf = SGDClassifier(loss = 'log',  
                    max_iter = 1000, tol = 1e-3, n_jobs = -1)
pipeline = Pipeline(steps = [('sampler', sampler), 
                             ('clf', clf)])

# Create the parameter space to explore.
params = {'clf__penalty': ['l1', 'l2']}

# Define the cross validation strategy.
splitter = StratifiedShuffleSplit(n_splits = 10)

# Define the model selection metric.
scorer = make_scorer(average_precision_score, needs_proba = True)

# Instantiate and run the cross validation.
cv = GridSearchCV(estimator = pipeline, 
                  param_grid = params, 
                  scoring = scorer, 
                  n_jobs = -1,
                  cv = splitter)

model = cv.fit(X, y)

# Print the parameters of the best estimator.
print(model.best_estimator_)
Pipeline(memory=None,
         steps=[('sampler',
                 SMOTE(k_neighbors=5, kind='deprecated',
                       m_neighbors='deprecated', n_jobs=-1,
                       out_step='deprecated', random_state=None, ratio=None,
                       sampling_strategy='auto', svm_estimator='deprecated')),
                ('clf',
                 SGDClassifier(alpha=0.0001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', loss='log',
                               max_iter=1000, n_iter_no_change=5, n_jobs=-1,
                               penalty='l2', power_t=0.5, random_state=None,
                               shuffle=True, tol=0.001, validation_fraction=0.1,
                               verbose=0, warm_start=False))],
         verbose=False)

Model Evaluation

Now that we have selected a model it is time to find out how well it performs on the data. Our metrics of choice are the receiver operating characteristic, the precision-recall curve, and the areas under each. These should give some idea of real-world performance on highly imbalanced classification tasks.

Due to the low training time required (on the order of a few seconds) we will reshuffle and split the data, oversample, train, test, and evaluate 1000 times. This takes around two hours on a reasonably fast laptop.

In [42]:
# Fit the model with best parameters and get the ROC, PRC, 
# AUC-ROC, and AUC-PRC. Repeats 1000 times, using different 
# training and testing sets each time.
#----------------------------------------------------------#

# Define the train-test splitter.
splitter = StratifiedShuffleSplit(n_splits = 1000)

# Instantiate score storage.
roc_aucs = []
prc_aucs = []
precisions = []
recalls = []
fprs = []
tprs = []
times = []

for train, test in splitter.split(X, y):
    # Split the testing and training sets.
    X_train = X.iloc[train].values
    X_test = X.iloc[test].values
    y_train = y.iloc[train].values
    y_test = y.iloc[test].values
    
    # Fit the best model tested. Also oversamples.
    clf = model.best_estimator_
    start_time = process_time()
    clf.fit(X_train, y_train)
    end_time = process_time()
    y_pred = clf.predict_proba(X_test)[:, 1]
    
    # Determine the ROC and PRC.
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
    fpr, tpr, thresholds = roc_curve(y_test, y_pred)

    # Evaluate and store metrics.
    roc_aucs.append(roc_auc_score(y_test, y_pred))
    prc_aucs.append(auc(recall, precision))
    precisions.append(precision)
    recalls.append(recall)
    fprs.append(fpr)
    tprs.append(tpr)
    times.append(end_time - start_time)

Visualizing Model Performance

That done, we can now visualize the performance of our model. In the line charts, the lightly shaded region spans the 0.05 to 0.95 quantiles, and the darker shaded region the 0.25 to 0.75 quantiles. The line is the mean at each point.

In [43]:
# Graph the 5%, 25%, 50%, 75%, and 95% quartile PRCs.
#----------------------------------------------------------#

#  Perform linear interpolation to make PRC matrix.
int_recalls = np.linspace(0, 1, 300)
int_precisions = np.interp(x = int_recalls,
                           xp = recalls[0][::-1], 
                           fp = precisions[0][::-1])
for i in range(len(recalls) - 1):
    int_precision = np.interp(x = int_recalls,
                              xp = recalls[i + 1][::-1], 
                              fp = precisions[i + 1][::-1])
    int_precisions = np.vstack((int_precisions, int_precision))

# Find quantile curves.
lower_precision = np.quantile(int_precisions, 0.05, axis = 0)
low_precision = np.quantile(int_precisions, 0.25, axis = 0)
med_precision = np.median(int_precisions, axis = 0)
high_precision = np.quantile(int_precisions, 0.75, axis = 0)
higher_precision = np.quantile(int_precisions, 0.95, axis = 0)

# Plot curves with quantiles filled.
plt.figure(figsize = (8, 8))
plt.plot(int_recalls, med_precision)
plt.fill_between(int_recalls, lower_precision, higher_precision, alpha = 0.2)
plt.fill_between(int_recalls, low_precision, high_precision, alpha = 0.3, color = "C0")
plt.xlabel("Recall", fontsize = 14)
plt.ylabel("Precision", fontsize = 14)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.grid(True)
plt.xlim((-0.01, 1.01))
plt.ylim((-0.01, 1.01));

plt.savefig("Figures/PRCs.png", bbox_inches = "tight")
In [44]:
# Graph the 5%, 25%, 50%, 75%, and 95% quartile ROCs.
#----------------------------------------------------------#

#  Perform linear interpolation to make ROC matrix.
int_fprs = np.linspace(0, 1, 300)
int_tprs = np.interp(x = int_fprs,
                     xp = fprs[0], 
                     fp = tprs[0])
for i in range(len(fprs) - 1):
    int_tpr = np.interp(x = int_fprs,
                        xp = fprs[i + 1], 
                        fp = tprs[i + 1])
    int_tprs = np.vstack((int_tprs, int_tpr))

# Find quartile curves.
lower_tpr = np.quantile(int_tprs, 0.05, axis = 0)
low_tpr = np.quantile(int_tprs, 0.25, axis = 0)
med_tpr = np.median(int_tprs, axis = 0)
high_tpr = np.quantile(int_tprs, 0.75, axis = 0)
higher_tpr = np.quantile(int_tprs, 0.95, axis = 0)

# Plot curves with quartiles filled.
plt.figure(figsize = (8, 8))
plt.plot(int_fprs, med_tpr)
plt.fill_between(int_fprs, lower_tpr, higher_tpr, alpha = 0.2)
plt.fill_between(int_fprs, low_tpr, high_tpr, alpha = 0.3, color = "C0")
plt.xlabel("False Positive Rate", fontsize = 14)
plt.ylabel("True Positive Rate", fontsize = 14)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.grid(True)
plt.xlim((-0.01, 1.01))
plt.ylim((-0.01, 1.01));

plt.savefig("Figures/ROCs.png", bbox_inches = "tight")
In [45]:
# Plot a histogram of AUC-ROCs, binned with Freedman-Dianconis.
#----------------------------------------------------------#

plt.figure(figsize = (8, 8))
plt.hist(roc_aucs, bins = "fd", density = True)
plt.xlabel("ROC Area", fontsize = 14)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13);

plt.savefig("Figures/AUCROCs.png", bbox_inches = "tight")

print("ROC Mean: {}".format(np.mean(roc_aucs)))
print("ROC Standard Deviation: {}".format(np.std(roc_aucs)))
ROC Mean: 0.9770445323177105
ROC Standard Deviation: 0.013774255705141911
In [46]:
# Plot a histogram of AUC-PRCs, binned with Freedman-Dianconis.
#----------------------------------------------------------#

plt.figure(figsize = (8, 8))
plt.hist(prc_aucs, bins = "fd", density = True)
plt.xlabel("PRC Area", fontsize = 14)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13);

plt.savefig("Figures/AUCPRCs.png", bbox_inches = "tight")

print("PRC Mean: {}".format(np.mean(prc_aucs)))
print("PRC Standard Deviation: {}".format(np.std(prc_aucs)))
PRC Mean: 0.7592972800919404
PRC Standard Deviation: 0.05999534896522203
In [49]:
# Plot a histogram of the training times, binned with Freedman-Dianconis.
#----------------------------------------------------------#

plt.figure(figsize = (8, 8))
plt.hist(times, bins = "fd", density = True)
plt.xlabel("Training Time (s)", fontsize = 14)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13);

plt.savefig("Figures/Times.png", bbox_inches = "tight")

print("Training Time Mean (s): {}".format(np.mean(times)))
print("Training Time Standard Deviation (s): {}".format(np.std(times)))
Training Time Mean (s): 4.935703125
Training Time Standard Deviation (s): 0.13176588995585836

Conclusions

Our model performs fairly well on the data, capable of 0.8/0.8 recall/precision in the median case and taking only a few seconds to train. The variation in model performance over training sets indicates that performance could be improved with data on more fraudulent transactions. The data indicate that it may then be possible to achieve 0.9/0.8 recall/precision.

Future work could explore the performance impact of different undersampling, resampling, and oversampling techniques.

This model can be used live without having to be fully refit, making it capable of real-time credit card fraud detection.

In [ ]: